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Goals 


e Study several approaches to storing and retrieving NASA HDF5 
(& netCDF4) data using Amazon Web Services (AWS) Simple 
Storage Service (S3) and Hyrax server. 


e Explore strategies for granulizing and aggregating data that 
optimize both performance and cost for data storage and 
retrieval. 


¢ Develop a cloud cost model for the preferred data storage 
solution that accounts for different granulation and 
aggregation schemes as well as cost and performance trades. 


 EOSDIS 


Unrestricted Content 


Methodology 


¢ Three architectures explored using proof-of- 
concept code 


Three sample NASA data collections 


Index files with dataset storage information 
¢ HDF5 library, hSpy Python, Hyrax 
¢ Representative use cases with NASA data 


Analysis of performance, access and cost logs 
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Archit. #1: Baseline Hyrax Data Access 


( EOSDIS 


Archit. #2: Files With HTTP Range-Gets 


Request -—— 


Response —3> 


- Per Variable Range Access 
Info 


* Index Location TBD 
* Read metadata (DMR/DDS/DAS, coordinates?) from index. 
* Read data from S3 object using HTTP range GET. 
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Archit. #3: HDF5 Datasets as S3 Objects 


Access Info 


* Index Location TBD 
*“ Read metadata (DMR/DDS/DAS, coordinates?) from index. 
* Read data shard from S3 using simple GET. 
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Sample Data Collections in AWS $3 


NASA MERRA2 (7,466 NASA AIRSSTD3.006 2015 Sample HDF5 (96 files; 
HDF5/netCDF files; 231GB) (365 HDES5 files; 122GB) 125MB) 
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Index Files 


¢ Acatalog of file content and dataset byte storage 
information. 


¢ One for each file in the data collections. 


¢ Hyrax Dataset Metadata Response (DMR) XML used 
as the ST. 


¢ HDF4 File Map XML used for HDF5 dataset storage 
information (chunk sizes and offsets). 


¢ Prototyped HDF5 Library to provide this data storage 
information — tool reads and modifies DMR 
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Index File Content 


<?xml version='1.0' encoding="UTF-8'?> 
<Dataset xmins="http://xml.opendap.org/ns/DAP/4.0#" 
xmins:h4="http://www.hdfgroup.org/HDF4/XML/schema/HDF4map/1.0.1" 
dapVersion="4.0" dmrVersion="1.0"> 
<Dimension name="Latitude" size="180"/> 
<Dimension name="Longitude" size="360"/> 
<Float32 name="ClrOLR_A"> 
<Dim name="/Latitude"/> 
<Dim name="/Longitude"/> 


chunkPosition|nArray='[0,0|" offset= 
uuid="b0abe13 
</n4:chunks> 
</Float32> 
</Dataset> 


“f2049" mdo= "B70 (670ae423d0fda9fdb6f33e8f186c" ’ 
InArray="10,0)" offset="130440821"_[ 
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UUIDs vs. Checksums 


Unique chunks 


Redundant chunks 


Using UUIDs as object identifiers seems like a good choice. 


Analysis of the checksums identified identical chunks (same checksum) repeated in every file in 
a dataset. These chunks can account for a significant portion of the datasets (30-90%). 


Storing these chunks once could decrease storage and access costs significantly. 
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Use Cases 


CF responses: Access the CF-enabled DAP4 Hyrax DMR and Data responses. 

Default responses: Access the default DAP4 Hyrax DMR and Data responses. 

Timeseries request: one pixel MERRA2 files, Query: PRECCU[0:1:*][1][1] 

Timeseries request: one pixel AIRS files, Query: Temperature_A[0:1:*][1][1][1] 

CF responses (contiguous): Access the CF-enabled DAP4 DMR, Data responses, HDF5 w/ contiguous storage. 
2/8 chunks spatial subset: AIRS files, Query: Temperature_A[0:1:*][13:1:15][40:1:45][175:1:195] 
Decimated variable: AIRS files, Query: Temperature_A[0:1:*][0:8:23][0:15:179][0:15:359] 

4 /16 chunks spatial subset: MERRA2 files, Query: PRECCU[0:1:*][160:1:200][245:1:295] 

Decimated variable: MERRA files, Query: PRECCU[0:1:*][0:60:360][0:8:575][0:15:359] 

Random spatial subset(10): MERRA2 files, Query: PRECCU[0:1:*][160:1:199][245:1:294] 

Random spatial subset(10): AIRS files, Query: Temperature_A[0:1:*][0:8:23][0:1:39][0:1:49] 

Random spatial subset(all datasets): MERRA2 files, Query: PRECCU[0:1:*][160:1:199][245:1:294] 
Random spatial subset(all datasets): AIRS files, Query: Temperature_A[0:1:*][0:8:23][0:1:39][0:1:49] 
Decimated variable(17): MERRA2 files, Query 

Decimated variable(30): AIRS files, Query 

All data: 100 MERRA2 files, Query: none 

All data: 100 of the AIRS files, Query: none 
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Request Tracers 


OPeNDAP 
Request 
User 
Tag Tag oo 


Development 
Team 


Access Cost 
(Tee | - [tae ) 
[Tag |] (Te | 
H Y 


cloudydap={UseCase} {Arch} STARTED_{seconds-since-epoch}.h5 where: 

e {UseCase} is the use case identifier, e.g. UC1 for the Use Case 1. 

¢ {Arch} is the architecture identifier, e.g. AICFT for Architecture #1 CF=True 

¢ Hyrax server.seconds-since-epoch would be replaced with the output of a date 
+%s command (ex: 1485208202) which should be the same for every request in a particular 
run of a collection of the use cases. 


UC1_A1CFT STARTED. 1485208202.h5 
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Performance / Costs 


Architecture #1 
DAP request = S3 request = HDFS5 file 


Architecture #2 
Cache is not required 
One DAP request = one or more S3 
range gets / HDF5 dataset 


Architecture #3 
One DAP request = one or more 
S3 requests / HDF5 dataset 
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93 Processing Times 


Architecture 1 retrieves entire files — bigger files take longer 


Architecture 2/3 retrieve chunks 
small chunks are fast 


Architecture 1 (All Use Cases) 
Use Case: UC11 
MERRA2 


112/165 
Mb 


§ 


: 


Tota! S3 response time jms} 
g 
Total_Time_Ms 


2.000048 2 000+8 
Bytes Sent bytes) 


User response times include connection initiation times in addition. 
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Range-gets vs. Objects 


Use Case: UC16 Use Case: UC17 
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Total_ Time_Ms 
Total_ Time_Ms 


8 
Total_Time_Ms 
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Use Case: UCi8 Use Case: UC19 
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Total_ Time_Ms 
& 


Total_Time_Ms 
Total_Time_Ms 


~ More Details in Notebook 
 EOSDIS 1s 


Unrestricted Content 


Performance Summary 


Data in the Cloud Performance Comparison 


C/F = Chunks / File = the ~number 
of chunks that can be retrieved in 
the same time as a complete file. 


HTTP connection = ~225 ms 
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Cost Modelling 


e Fixed costs 
— EC2 instances (Hyrax servers) (SSSS) 
— Data / Metadata (?) in S3 (SSSS) 

¢ Dynamic costs 
— Number of Hyrax requests to S3 (¢) 
— Outbound data (SSS) 
— Cache type and size (SSS) 


— Data flow from S3 to Hyrax server(s) (SS if not in the same 
AWS region) 
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Cost Comparison 


S3 Storage: $0.022/GB-month @ ~25% discount?? 


S3 Request: $0.0000004 - 


EFS cache: $0.3/GB-month 
NFS on AWS 


EBS cache: $0.1/GB-month 
Local disk 


One month of EC2 m4.xlarge instance: $156 
= 7/13 GB of storage 
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Architecture Comparison 


i. Architecture 1 Architecture 2 Architecture 3 


Performance Faster for requests for large 
number of variables or Faster than A1 for requests accessing small number of variables. 
entire file 
Slower for requests for many variables or the entire granule. 
Slower for requests for a 
small number of variables 


Processing Costs Depends on processing time 


Storage Costs ~Equal Can be significantly lower 
depending on repetition of data 
values within the granules / 
dataset. 
Original granule Yes No 
retrieval 
Data Migration to Copy each file to a single S3 object Shred each file into 
Cloud multiple S3 objects 


Commercial Web Potentially significant if crawlers require large amounts of information to move from S3 to the data 
Crawler Access (Googl server. This situation can be mitigated by limiting crawler access to just metadata held by the 
eBot, etc.) server. 
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Recommendations 


Integrate support for S3 access into the HDF5 Library 


Model current DAAC data use with detailed / 
consistent Hyrax logs 


Model and mitigate web crawler costs 

Develop an adaptable server: retrieval strategy 
depends on nature of request 

Refine the implementations of A1, 2 and 3 (parallel 
requests, reuse connections, ...) 

Explore deployment utilizing a serverless 
architecture 
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